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The setting of standards involves subjective value 
judgments. The inherent arbitrariness of specific standards has been 
severely criticized by Glass- His antagonists agree that standard 
setting is a judgmental task but they have pointed out that 
arbitrariness in the positive sense of serious judgmental decisions 
is unavoidable. Further^ small misplacements of the standard 
therefore can be considered inconsequential. In this paper, the 
uncertainty with respect to the 'true* standard is quantified and the 
consequences of the specification of the uncertainty on the optimum 
passing score are studied- In a second approach the assumption is 
made that the standard setters not only have information vith respect 
to the position of the standard^ but also relative information with 
respect to the level of the target group of examinees. This 
informaticn can be used in the final setting of the standards If 
performance is lower than expected, one does better by lowering the 
standard a certain amount by means of a preconceived strategy. 
(Author/BL) 
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ACCOUNTING FOR THE UNCERTAINTY JN PEKKOH MANCE STAND ARDS 
Dato N.M. de Gruijter, Uiiivt'rsiiy t)l Leyden 

Abst ract 

The setting ot staiulards involves subjective value Judgments. The inherent 
arbitrariness ol specific standartJs has been severely criticized by Glass 
in a special issue on standard setting in the 1978 volume of the Journal of 
Educat ional Measurement . His antagonists agree that standard setting is a 
judgmental task but they have pointed out that arbitrariness in the posir 
tive sense of serious judgmental decisions is unavoidable. Further, small 
misplacements of the standard therefore can be considered inconsequential. 

The point of view in this paper is that the argumentation should not remain 
on the verbal level only, i.e. it is proposed to quantify the uncertainty 
with respect to the /true' standard and to study the consequences of the 
specification of the uncertainty on the optimum passing score. 

In a second approach the reasonable assumption is made that the standard 
setters not only have information with respect to the position of the 
standard, but also relative intormation, i.e. information with respect to 
the level of the target group of examinees. This information can be used in 
the final setting of the standard. If e.g. performance is lower than ex- 
pected, one does better l)y lowering the standard a certain amount by means 
of a preconceived strategy. 
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Ac counting for uncerta inty in ^<-i'ii riurfmc dn<ia r<1.s>'' 

Dato N.M. lie Gruijtt r 
bliiucational Hesearch Center 
University ot Leyden 

Iji t r • > d u c t i^o n 

In the decision-theoretic approach to mastery decisions utilities or losses 
for pasing and tailing are defined as functions of domain scores. For each 
examinee an observed score i>; obtained and tliat decision which results in 
the smaller expected loss is mad« . 

The utility or loss structure utes - explicitly or implicitly - the concept 
of a standard of performance. In t^»reshold loss e.g., the loss associated 
with failing an examinee is larger than zero for domain scores ^ > i 
where is the performance standard, and equals zero for n < . In more 
realistic continuous utility ' -actions one might call the break-even value, 
the value of n for which the ilities for passing and failing are equals 
the standard. 

In the early decision- theoretic literature on mastery testing the standard 
is taken for granted. In more recent years one sees attempts by psychomet- 
ricians to develop methods specifying utilities, arid so implicitly stan- 
dards, more carefully (see Novick and Lindiey, 1978). The more advanced 
techniques, however, require subjective values judgments, which introduce a 
source of mistakes. This may t»cconie apparent when experts disagree with 
respect to the specification ot utilities. 

Glass (1978), in a special issue on standard setting in the Journal of Edu- 
cational Measurement , severely cri ticii^es the standard setting approach in 
testing, referring to the arbitrariness of standards; and some contemporary 
practices may serve as examples demonstrating that he is not entirely 
wrong. His opponents (e.g. Popham, 19 78) agree that standard setting is a 
judgmental task, but they do point out that arbitrariness, in the positive 
sense of ser i ous j udgmenta 1 decision making , is unavoidable . Block (1978) , 
referring to Schwab (1969), suggests in connection with this discussion 
that one cannot expect right solutions, but at best defensible solutions. 

Paper presented at the Fourth international Symposium on Educational 
Testing, Antwerp, June 24-27, 1980. 



Moreover, Scriveii (1978) remiiiiJs us tlial Ihertf is a range of domain scores 
tor which passing or failing doesi nut make inuth difference; this is re- 
flected by realistic ut i 1 i ly' M uuct i ons . Therefore small misplacements of 
the standard may be considered inconsequential. 

Although uncertainty with res[>ecL to the adr<^.jacy of a given standard is 
generally admitted, it has not been given . rormal treatment within the 
decision-theoretic approacli to mastery testing. One of the aims of this 
paper is to provide such a treatment in connection with threshold loss. 

Part of the uncertainty with respect to the most adequate position of the 
standard may be due to the fact that a readily availa ie type of infor- 
mation, normative information, is often neglected. The use of normative 
information seems incompatible with the philosophy of mastery testing. It 
will be argued, however, that in fact it is not. How normative information 
may be used in standard setting, will be demonstrated. 

Threshold loss and uncertainty in s tan dard sett ing 

Let us assume that a standard has been set. Further, let us assume that 
threshold loss represents an adequate approximation to losses due to clas- 
sification errors, i.e. the following loss structure is used: 



n < n > 



pas s 
fai 1 





0 


0 


^10 



with ^Q^y > 0. In fact, one only needs to determine the loss ratio 

^ " ^01^^10* Therefore, in the tol lowing L^^ is replaced by L, and L^^ by 
1< Let us further assume that each examir.ee answers the same number of 
Items n and that the binomial error model holds. This means that the ob- 
served score X is a sufficient statistic tor 7T . 

Examinee p should be passed if the expected loss on failing (the probabil- 
ity of mastery times the loss <hie to ' d i 1 i ng a master) exceeds the expected 
loss on passing. 



(1) {n > n Jx ^ - \,G{n v n |x } 

p - 0 « p |) 0 ' p * 



Tlie opposite .Id i oil is taken in t jst- thi- i iu'<|ij.j I i i y s>ign is reversed » while 
we are i inl i M tweiit witti rt'^ptu t to t hr .ittioii taktMi in case of an equality 
sign. The c oml i t i oiia I p rohatM I i t i * s m (I) c.iii be computed t>y the appli- 
cation of Bciyes' tlieort-iii in tdsf I he d i s t r i hut i on of n is known. The first 
score X for which (\) is s.itistied, is ;.he optimal cutting score. 

In the binomial error moile I the erroi variriiut- varic-s as a function of 71, a 
troublesome characteristic mi some analyses. Therefore^ in this paper 
transformed tiomain scores y - yl^ <ire ijseti instead: this inverse sine 

t rans tormat ion is a va r i ance- s t a b i 1 i z i iig transformation (see e.g. Novick 
e^a_l . , 1973). tor observed proportions the tolJowing transformation is 
chosen : 

gp = sin"^ V (x^ + 3/«)(n + 3/4) . 

The distribution of g^ can be app lox i nia t e<l by N(y , (4n -t- 2)"^), which 
implies that the error variance on the transformed scale is independent of 

Assuming that y is approximately N(p^, (|)^) and the sample size is large - 

which means that jj and 0 are aicurdtely estimated - the posterior dis- 

tribution ot the transformed domain stDre of examinee p, y , is approxi- 

P 

miitely normal with a mean equal to 



where 



(3) 




+ (4ii+2) 

and a variance rcjuaJ to 

(AJ ^ = p(An+2)'^ . 

Ecjiiation (2J is a Kel iey-es L i ina l e ■>( fx.iim rif t- ' s ji I ra : ormed domain score. 
Using inequality ( p, the criterion for |).ihsiiig Netomes 



(5) 
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where <t> is the cumulaiive noniuiJ »l i h t ii [>ul ion aiui = sin"^ ^n^ (cf. 

Equat ion 1 ) , 

Now we are ready lo iniro<lui<- um 1 i cj i n i y wiih re:>pecL Lo 71^ and, ior that 
matter, . Let us assiune th.il litis unterldiiity may be reflected by a 
normal distribution !or , N(j.j , <|> ). 

Here y^ may be the standard axreed upon and is determined in such a way 
that it reflects the amount ot unce ainty with respect to the correct 
value of y^. The distribution ot y^^ - y^ determines the optimal decision 
(passing or failing). Out- obtains for tliis <I i s t r i but ion 

(6) y -y,, -^N(y - u •a) + (|>). 



The criterion for passing be 



comes 



(7) 1 - <l>[(cl)+cJ)^)"^{0-(y-M ){J > L<Dl(())t()) r^{0-(y-M )}j 

^0 ^ ^0 



Of course one has to Ut- aware of tfit? tact that in the above derivation of 

the distribution of y - y errors in the estimation of y - v for dif- 

P p 0 

terent persons p are torrriated due to the roinnion variation in y^ . In 

Figure I both cases, the case of variable y^ and the case of fixed y^ , are 
disp layed » 

For a loss ratio equal to one <iecisions are not affected by the introduc- 
tion of uncertainty in y^ ; one obtains indifference between the decision to 
pass and the decision to fail for y^ = (or = y^). For other loss 

ratios, however, tne introduction of uncertainty in y^ may make a dif- 
ference . 

Making optimal decisions by the minimization of expected loss under un- 
certainty with respect to y^ , is only one side of the problem. It is also 
important to study the effects of different possible values of on de- 
cision making. This calls for a robustness study (Vijn, 1980) in which 
intervals for y^ cha racte r i i^ed by the same optimal cutting score arc com- 
puted. In fact, it would f>e even better to study the joint effect of vari- 
ation in y^ and L on the optimal cutting score since in this way one also 
makes allowance tor un':ertainty witlt respect to the proper choice of L. 



This results in regions in iht* t wo-d untwis i oim 1 ( , L) space corresponding 
to particular v-ilues for the rutting score. 

The idea to vary L is not a new one in ihe decision theoretic approach to 
mastwry testing. Some authors have presenteil their data in i^uch a way that 
the consequences of values ui other r han the one preferred by these 

authors, can be determined. This means that i he reader who prefers another 
loss ratio, may examine the resalts from his or her own point of view. 
Furthermore, such a presentation is useful in case of uncertainty with 
respect to 1, as has been suggested above. A mce example of such a pre- 
sentation is an article by Huynh (19V7). 

In the following example of a robustness study in which and L arc var- 
ied, J shall use Mellenbergh ei^aj.'s (1977) data from 184 examinees on a 
i9 item mastery test. Their data can be fitted by a normal distribution for 
g with mean 



g = = 1. 128 



and variance 



2-2 
s = a = . 024 . 



The variance of y equals (p^ = Oil, i\u- posterior variance of y <?quals 
(p = .006 (estimates are assunud to equal the true values here). 
Using a loss r io equal to Z and a standard y^ equal to 1,107 (corre- 
sponding to = .80), the optim^il lutting scor. equals g = 1.217 (corre- 
sponding to an observed score uf 17). The same optimal cutting score is 
obtained if the fixed y^ is replaced by the distribution N(1.107, .0025). 
The results were obtained replacing the cumulative normal distribution by 
the closely related cumulative logistic distribution 

exp (1.7t)/n exp (I.7t)] - ^(«-)'^'^ 

I have en the scale factor 1.7 which generally is used while using 

this :^ ^ factor, the cumulative logistic differs by less than 0.01 
from the .omulative standard iiormaJ distribution for all values t. 
Molenaar (1974) demonstrates that the scale factor 1.6 brings the 
densities in close agreement while for|t|< 1 the agreement between the 
cumulative distributions is close. 



which means that for lixt-.l y^^ i ri.i ^ 1 t e rem between passing and tailing 
obtaines i t 

t8a) K7 i^'^ (y - y^ ) = Log L 

and tor y^ as a variabU' it 



(8b) 



1 . 7 { Y-M ) = lug L 



From (8a) it is easy to obtain boundaries of indifference regions by sub- 
stituting Y ^rom (2) for several values of g(x) . For a particular value of 
X Equation (8a), with Jog L as a function of Yq » defines the boundary 
between the region where x is the optimal cutting score and the region 
where x^-l is the optimal cutting score; one may verify this by substituting 
a 'greater than' sign in Equation (8a) for passing. In Figure 2 regions are 
given for 1 1 L < 3 and - 2(p^ < + 2<p^. 

Conspicious is the sensitiveness of the optimal cutting score to changes in 

Yq due to the unreliability of the lest, i.e. the strong regression of Y 

P 

to u . 
Y 

Since the result is <i i s.jjpi> i n L i ug . <>nt should put a lot of effort in di- 
minishing the uncertainty with respect to Y^ - l-'or example, supposing that 
only one expert has been used in setting the standarc' and that reflects 
the uncertainty with respect to the resulting standard, one may reduce the 
uncertainty by having the standard set by more than one expert. The under- 
lying critical assumptiou is that expert opinion is unbiased (Jaeger, 
1979) . 

One may wonder what would have haj^pened if- a more realistic loss function 
than threshold had been used instead. Take tar instance linear loss with 
respect tc failing 

'-fail = " 

Vass (Y) = - .V - y,,) a > 0. 
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Figure 2. Robustness region.- lo- , ai.d 1. . Numbers in the figure 
give optimal cutting scorts. 
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Htrr ilie i»ptimal tuLLuiK store is y - Y^j -Mid a thunge Ay^ corresponds to ^ 
rharigc* (l/p)Av^ 111 the optimal ciiltiii^ score k (Lrealing g as a continuous 
variahlr). With respect to r<»l»us t luss to tliaiiges m tliere is no dif- 

tereiue witli threshold I »s . fjowevtr, there inig»it be a ditlerent inter- 
pretation oi the serioijsn ss a t hangt* i ii tfie cutting score due to the 
character itit I cs ot the I inr.i i r lo!>s tuiutiuii, winch weights wrong cias* 
sifications ot horderiine examinees less heavily. 

Th^ ttse of norm a t i ve^ nitormat ion in standard sett i iig 

In the previ us section I have demonstrated that, if there is a range of 
piausibie y^ values, there are several alternative cutting scores. Here I 
will try to demonstrate that normative information may be used in standard 
setting; the use of extra information reduces the range of plausible y 
values. A proposal based on another philosophy has been presented by Hof- 
stee in this symposium. 

Clearly 0: th instruction and examinatiort standards should be geared to the 
level o! thr jrget group oi examinees, i.e. their entrance characteristics 
in a brc>ad sense. If performance with respect to a given standard is rel- 
atively low, one may enlarge the percentage of masters by lengthening the 
uistructi* ral period^ but this is only done with diminishing returns. In 
11 h a situation^ lowering the LL.iinlard could he the more realistic action. 

Ill I iS way the actual examinee rtsults play a role in standard setting. 
t)nt- could say with Hotstee (1973) th.it realistic standards in the end are 
normative A similar point of view is takt?n by Shepard (1979). 

At lirst sight normative ideas look incompatible with the philosophy of 
cr tenon-referenced testifig. The use ol normative information, however, 
does not mean the introduction ol an arbitrary group of persons and an 
arbitrary passing percentage defined lor tins group. Neither does it mean 
that the standard varies with tlie group of examinees, a procedure which 
correctly is rejected in cri te r i on - re t e renced measurement. Here only the 
claim is made that normative information can be useful in reducing un- 
certainty in standard setting. 

In practice it may be difficult to make a distinction between normative and 
absolute information of performance. For instance, information on perfor- 
mance in comparable or subsequent instructional programs can be useful in 
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slan,lar,l st-tting. Hut t Im- v.. I u.a , „.. ol p.- , I orraance depends oo arbi- 

trary decisions witf. r.-sptrt t., tl,.- s.l..nd..r.l . H..,s.e programs, possibly 
b.Ksed on a m.xtnr.. .,1 normal, v.- ..n.) aUsuhilr arK.unrnts. For this reason 
Cilass (1/8) criU.iz.d Huynh's (|«i/b) ,».|..r,al loss as bootstrapping on 
oth.r criterion scores. Neve M h.- 1 s . ...ore o. I <.ss vague ideas concerning 
the p<rc..ntage ot exam . ..e.s wUo have a salislaclory perlormance exist; it 
is on this basis that .Shepai.l { \>) ) suRgests the correction of the stan- 
dard if the percentage ut tu. lures seems inordinate. In my opinion such 
feelings should be tormal.ze.l be I o rel.an.l . so thai we may state a priori for 
all possible outcomes o.. an exam...ation whiih decision should be m«de. Such 
a procedure is more sol.staclory than a simple correction by hindsight. 
Assume that the vague nor...ative k.iowledge can be formalized in a density. 
Specifically let F' - log (F/(l-F), where F is the proportion of exaoiinees 
thought to be satisfactory, be ^^,); f itself has approximately a 

beta distribution. 

turther, let be N(|J^^ , 4.^; as i 1. the previous section and let us assume 

that and F' are independently .1 1 s I r i buted . Equiprobability contours of 

the bivariate distribution of -^nd F* are given by 

^''o'^^O " •^Y^'^ ^ - t-V'^*^ = constant. 

Assuming that the cumulative .1 . si . . r.i.l . o„ of y c^n be approximated by the 
cumulative logistic, we ohla.n th,- tollow.ng relation between transformed 
<loniain score and t rans f ornie.l p . oj., . r I . . )n of examinees exceeding the trans- 
formed domain score 

no) F- = - 1 . 7 0 ' ( , - ,j ). 

y V 

The line defined by (10.) ,s i.iMK<.a to ,„if of the ellipses defined by (9). 
The point of contact defines v i I y^^ an.l F' satisfying (10) having the 

highest joint probability of . if.is value of is the new Stan- 

da rd . 

It is easy to deduce that points ot tontatt for equations (10) differing 
onJy ,11 the value ot p^, lyif^K on a line through (p , p^.) with a 

slope equal to (0^/ 1 . 7 ) , /(J)^^ J . ^ 

In Figure 3 the procedure in <le<,ious t t a ^ ed for the Mellenbergb et al. data. 
Here = 1.107, p^,, = .8^7 ( t (>rres[>ofw1 i ng to a proportion F equal to .70) 



11 

and 0pt/4>Q = 100. In the figure the line is drawn connecting optimal points 
' > Yq) for distribiit -.ons hav-.n^ the same variance on the transformed 
scale as the Mellenbergh et a 1 > dald. The intersection of the two lines in 
the figure gives the corrected standard ( = 1.083). For this standard the 
optimal cutting score, given L - 2, is 16 instead of 17. Figure 4 gives the 
relationship beiweer. F and for ciis t r i hut ions having the same <J)^ as in 
the example; the relation between K and in this case is surprisingly 

linear in the range of interest. 

One may choose the joint distribution of y^ and F' in such a way that for a 
specific distribution of y predetermined values of y^ and F' are obtained. 
This is very useful when a new program is to replace an old one. In case 
the unfortunate outcome is that the new program did not result in a learn- 
ing change, one supposedly would like to stick to the old value of . If, 

*0 

however, the mean score remains the same, but the variance changes, it is 
not possible to have the same using this procedure. If one would like to 
keep the old standard in this case, another procedure is in order; such a 
proposal has been made earlier (De Gruijter, 1978) using prior information 
with respect to the population mean instead of prior information with 
respect to F* . 

Two remarks remain to be made. First, I have assumed that the population 
distribution of y is known while m practice only sample information is 
available and sometimes only after more testing occasions population para- 
meters can be accurately estimated by combining the data. Secondly, in the 
proposal it is assumed that information with respect to the proportions of 
satisfactory examinees is available, while Shepard defines the problem in 
terms of proportion passing. The use of proportion satisfactory is prefer- 
able while proportion passing also depends on test length. 

Examinations varying in difficu lty 1 eve I 

The binomial error model can be applied i u case every examinee has to 
answer a different set of items randomly chosen from the item domain or in 
case items have approximately the same difficulty level. Otherwise, the 
compound binoraia 1 error model may be appropriate . 

The use of tests composed of items varying in difficulty level, presents 
some complications: if items vary in difficulty level, so will the tests. 
This means that the standard appropriate for a particular test is defined 
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Figure 3. Transformed percentage satisfactory as a function of y and 
the line defining points of contacts for distributions with identical <}* . 




Q Figure 4. The relationship between 
ERIC with identical <)> . 



-iiid F for population distributions 



ia terras of the relative true scale of tijis test and that this staa- 

dard ciay differ to an unknown <"xtfi:L from the standard on the domain score 
scale - where the domain scoit- is tlie expected relative true score with 
expectation r^aken over all pos3?l)le test forms - and the standards on other 
test forms. 

Interestingly enough, if t!ie >>r(jiips or examinees taking different test 
forms are all iaige and may bt* considered random samples from the popil- 
lation of examinees, standards stiuuld be transferred from one test form to 
another by a relative procedure. This again is an example where a relative 
approach iils into the framework ot criterion-referenced measurement (De 
Gruijter, 1978). A critical assumption is the randomness assumption; it 
does not hold if people react to the mean level of the group of persoas 
with which they study. However, the coliereace of large groups (probably 
consisting of many small groups) is small while, further, mean levels of 
different large groups do not diller much in case there are no systematic 
factors effecting differences. 



It is argued that a full dec i s ion- theoretic approach to criterion- 
referenced measurement should incorporate uncertainty with respect to the 
standard. Furthermore, it is demonstrated that normative information in 
itself is not incompatible with the idea of crx terion-ref erenced measure- 
ment. On the contrary, normative information may be used in order to de- 
termine more precisely a satisfactory standard. 



Summary 
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